Automated decision making using time-varying stream reliability prediction

ABSTRACT

Automated decision making techniques are provided. For example, a technique for generating a decision associated with an individual or an entity includes the following steps. First, two or more data streams associated with the individual or the entity are captured. Then, at least one time-varying measure is computed in accordance with the two or more data streams. Lastly, a decision is computed based on the at least one time-varying measure. One form of the time-varying measure may include a measure of the coverage of a model associated with previously-obtained training data by at least a portion of the captured data. Another form of the time-varying measure may include a measure of the stability of at least a portion of the captured data. While either measure may be employed alone to compute a decision, preferably both the coverage and stability measures are employed. The technique may be used to authenticate a speaker.

FIELD OF THE INVENTION

[0001] The present invention relates to automated decision makingtechniques such as speaker authentication and, more particularly, totechniques for generating such decisions using time-varying streamreliability prediction in accordance with multiple data streams.

BACKGROUND OF THE INVENTION

[0002] The decision making process of authenticating (e.g., recognizing,identifying, verifying) a speaker is an important step in ensuring thesecurity of systems, networks, services and facilities, both forphysical and for logical access. However, accurate speakerauthentication is also a goal in applications other than secureaccess-based applications.

[0003] Some existing automated speaker authentication techniques relyexclusively on an audio stream captured from the speaker beingauthenticated. However, it is known that, even in the case of cleanspeech (e.g., speech collected over a high quality microphone in anenvironment with little noise), as opposed to the case of degradedspeech (e.g., speech collected over noisy phone lines or in environmentswith substantial background noise and distortion), there exists a subsetof the population for which audio-based authentication is problematicand inconsistent. For example, in G. Doddington et al., “Sheep, Goats,Lambs and Wolves: An Analysis of Individual Differences in SpeakerRecognition Performance,” NIST Presentation at ICSLP98, Sydney,Australia, November 1999, it is shown that there are speakers, termed“goats,” who are difficult to recognize based on their voice. Speakerswho are readily recognized based on voice are termed “sheep.”

[0004] Thus, other existing automated speaker authentication techniqueshave adopted an approach wherein, in addition to the use of the audiostream, a video stream representing the speaker is taken into account,in some manner, in making the speaker authentication decision.

[0005] In accordance with such two-stream systems, one may manuallychoose to make an a priori decision as to the efficacy of audio dataversus video data for each individual and subsequently use only the datacorresponding to the most effective modality.

[0006] Another option is to model the joint statistics of the datastreams. However, a more flexible option is to create modelsindependently for each data modality and utilize scores and decisionsfrom both. Previous studies utilizing independent models, such as thosedetailed in B. Maison et al., “Audio-Visual Speaker Recognition forVideo Broadcast News: Some Fusion Techniques,” IEEE Multimedia SignalProcessing (MMSP99), Denmark, September 1999, have been applied only atthe test utterance level in the degraded speech case.

SUMMARY OF THE INVENTION

[0007] The present invention provides improved, automated decisionmaking techniques. As will be explained herein, such techniquespreferably employ a time-varying stream reliability predictionmethodology in accordance with data obtained from multiple data streamsassociated with an individual or entity. Advantageously, the improveddecision making techniques of the invention provide a higher degree ofaccuracy than is otherwise achievable with existing approaches.

[0008] In one aspect of the invention, a technique for generating adecision associated with an individual or an entity includes thefollowing steps. First, two or more data streams associated with theindividual or the entity are captured. Then, at least one time-varyingmeasure is computed in accordance with the two or more data streams.Lastly, a decision is computed based on the at least one time-varyingmeasure.

[0009] One form of the time-varying measure may include a measure of thecoverage of a model associated with previously-obtained training data byat least a portion of the captured data. The coverage measure may bedetermined in accordance with an inverse likelihood computation. Theinverse likelihood computation may include modeling, for a time t, aneighborhood of a test vector associated with the captured data togenerate a test data model, and measuring the likelihood of one or moreparameters of the training data model with respect to the test datamodel. Also, the feature space associated with the test data model maybe transformed into the feature space associated with the training datamodel.

[0010] Another form of the time-varying measure may include a measure ofthe stability of at least a portion of the captured data. The stabilitymeasure may be determined in accordance with a deviation computation.The deviation computation may include computing a deviation of a scoreat time t from a point estimate of the score at time t based on aneighborhood of test vectors associated with the captured data.

[0011] While either measure may be employed alone to compute a decision,preferably both the coverage and stability measures are employed.

[0012] Furthermore, in one illustrative embodiment of the invention, atime-varying, context dependent information fusion methodology may beprovided for multi-stream authentication based on audio and video datacollected during a user's interaction with a system implementing themethodology. Scores obtained from the two data streams may be combinedbased on the relative local richness (e.g., coverage), as compared tothe training data or derived model, and stability of each stream. Then,an authentication decision may be made based on these determinants.

[0013] Results show that the techniques of the invention may outperformthe use of video or audio data alone, as well as the use of fused datastreams (via concatenation). Of particular note is that the performanceimprovements may be achieved for clean, high quality speech, whereasprevious efforts focused only on degraded speech conditions.

[0014] These and other objects, features and advantages of the presentinvention will become apparent from the following detailed descriptionof illustrative embodiments thereof, which is to be read in connectionwith the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015]FIG. 1 is a block diagram illustrating an authentication systemand an environment in which the system may be used, according to anembodiment of the resent invention;

[0016]FIG. 2 is a diagram illustrating behavior of a coverage parametercomputed by an authentication system, according to an embodiment of thepresent invention;

[0017]FIG. 3 is a diagram illustrating behavior of a stability parametercomputed by an authentication system, according to an embodiment of thepresent invention;

[0018]FIG. 4 is a block/flow diagram illustrating an authenticationmethodology for implementation in an authentication system, according toan embodiment of the present invention;

[0019]FIG. 5 is a block diagram illustrating an exemplary computingsystem environment for implementing an authentication system, accordingto an embodiment of the invention;

[0020]FIG. 6 is a diagram tabularly illustrating authentication ratesobtained in experiments conducted in accordance with the invention; and

[0021]FIG. 7 is a diagram tabularly illustrating performance ratesobtained in experiments conducted in accordance with the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0022] The following description will illustrate the invention usingexemplary data streams associated with a speaker, such as an audio datastream and a video data stream. An embodiment where another data stream(e.g., a data stream representing global positioning system (GPS) data)is employed will also be described. It should be understood, however,that the invention is not limited to use with any particular type ofdata streams or any particular type of decision. The invention isinstead more generally applicable for use with any data that may beassociated with an individual or entity, such that a decision associatedwith the individual or entity may be accurately computed via multipledata streams using one or both of the time-varying measures of theinvention. Also, by use herein of the phrase “multiple data streams,” itis generally meant that the invention can operate with two or more datastreams. Further, while the description focuses on an individual (e.g.,speaker), the techniques of the invention are equally applicable for usewith entities. It is to be appreciated that an entity can, for example,be a group of individuals, automated agents, etc.

[0023] Referring initially to FIG. 1, a block diagram illustrates anauthentication system and an environment in which the system may beused, according to an embodiment of the present invention. Morespecifically, the arrangement 100 in FIG. 1 illustrates a vehicle 102, aspeaker 110, a video channel 112, an audio channel 114, an additionaldata channel 116, an authentication system 1 18 and an application 120.

[0024] Thus, in the illustrative embodiment shown, multiple data streamsare obtained from a speaker 110 located in vehicle 102 via the audio,video and additional data channels 112, 114 and 116.

[0025] Audio data may be obtained via one or more microphones and audioprocessing equipment positioned in the vehicle 102 for capturing andprocessing the individual's spoken utterances such that an audio datastream representative of the utterances is produced. An audio stream isprovided to the authentication system 118 via audio channel 114. Thevehicle may also be equipped with audio processing equipment and anaudio output device to enable audio prompts to be presented to theindividual to evoke certain audio responses from the individual.

[0026] Further, video data may be obtained via one or more video camerasor sensors and video processing equipment positioned in the vehicle 102for capturing and processing images of the individual such that a videodata stream representative of images associated with the individual isproduced. A video stream is provided to the authentication system 118via video channel 112.

[0027] Still further, where additional data (in this case, data otherthan audio and video data) is being used by the authentication system,the additional data may be obtained directly from a data source. Forexample, where GPS data is being used, the authentication system 118 mayobtain the GPS data via the data channel 116 from a GPS device (notshown) in the vehicle 102, or the GPS data may be obtained via anothersource. It is to be understood that the GPS data may be used to assistin authentication by providing an indication of where the individual islocated.

[0028] It is to be understood that more types of data may be obtained bythe authentication system 118 via additional channels.

[0029] The specific operations of the authentication system 118 will bedescribed in detail below and with reference to FIGS. 2-5. In general,the authentication system 118 authenticates a speaker using time-varyingstream reliability prediction, as will be described in detail below,based on data obtained from multiple data streams.

[0030] The authentication decision determined by the authenticationsystem 118 is provided to the application 120. If the decision is thatthe speaker 110 is authenticated, then the speaker is given access tothe application 120. If the speaker is not authenticated, then thespeaker is not given access to the application 120 (or is given limitedaccess). By “access to an application,” it is generally meant that thespeaker may then interact with some network, service, facility, etc. Theinvention is not limited by the type of secure access-based applicationin any way. Further, the invention is not limited to use with anyparticular type of application.

[0031] It is to be understood that all of the components shown in andreferred to in the context of FIG. 1 may be implemented in vehicle 102.However, this is not a necessity. That is, the authentication system 118and/or application 120 may be located remote from the vehicle 102. Forexample, the authentication system 118 and/or application 120 may beimplemented on one or more servers in communication (wired and/orwireless) with the data capture devices (e.g., microphones, videocameras, GPS devices, etc.) positioned in the vehicle 102. Also, asmentioned above, it is to be understood that the invention is notintended to be limited to use in the particular arrangement (e.g.,vehicle-based arrangement) illustrated in FIG. 1.

[0032] Given the illustrative arrangement described above, time-varyingstream reliability prediction techniques for use by an authenticationsystem according to the present invention will now be explained indetail below.

[0033] In general, user authentication based on a speaker's interactionwith any system is based on the information that the system collects.The present invention realizes that it is possible and desirable toexploit multiple forms of information obtained by different transducers(data capture devices). Preferably, the transducers are operatingsimultaneously. In particular, the invention realizes that a significantbenefit can be obtained by analyzing the correlation in the multipledata streams. More significantly, however, any subset of the datastreams form a context for the analysis of any other subset of streams,allowing the formulation of a robust time-varying fusion methodology.

[0034] In the illustrative embodiment to follow, it is to be understoodthat only audio and video data streams are used. Recall however thatFIG. 1 shows that additional data (e.g., GPS data) may be employed inthe methodology. The addition of this other data type is straightforwardgiven the following detailed explanation of how audio and video datastreams are processed to make an authentication decision. Nonetheless,the use of a third data type will be explained in the context of FIG. 4.

[0035] The present invention provides a time-varying approach tomulti-stream analysis where, in general, the fusion of data, scores, anddecisions occur locally in time at the stream element level and relativeto a local “context” that includes a measure of data richness and dataconsistency. Herein, a combination of these two properties (i.e., datarichness and data consistency) may be generally referred to as“reliability.” The present invention realizes that the reliability ofeach data stream varies with time as the interaction proceeds andperformance gains can be achieved by using this knowledge intelligently.

[0036] As mentioned above, even in the clean speech case, there exists asubset of the population for which audio-based authentication isproblematic and inconsistent. Thus, the present invention providesmethods for predicting the reliability of the data streams. It may turnout for any point in time that only audio, only video, or a combinationof the two streams are used. And this combining process is time varyingin that the reliability is modeled locally as a function of time andother signal properties.

[0037] Thus, as will be seen, the invention realizes that an optimalapproach to speaker authentication is to model the multiple data streamsindependently and make an intelligent, time-varying decision as to whichmodel to use and/or how to combine scores for each point in time. Theresults presented herein are significant in that improvement in speakerrecognition performance may be obtained for the high quality, cleanspeech case (as evidenced by overall audio performance) by adding, forexample, video stream data and performing a time-varying analysis.

[0038] A description of the data streams or feature streams that may beemployed in accordance with an illustrative embodiment of the inventionwill now be provided. With reference back to FIG. 1, it is to beunderstood that the video stream and the audio stream are provided onvideo and audio channels 112 and 114, respectively. A vector-wiseconcatenation of the audio and video streams is also generated from thetwo streams. It is to be appreciated that such feature streams may begenerated in the processing equipment associated with the input devicesthat captured the raw data, or the feature streams may be generated inthe authentication system 118 once the authentication system receivesthe raw data streams from the input devices.

[0039] In any case, simultaneous recordings of audio and video data areused to produce three vector streams of interest: X^(a)={x_(t) ^(a)}(audio), X^(v)={x_(t) ^(v)} (video), and X^(av)={x_(t) ^(av)}(vector-wise concatenation of audio and video streams).

[0040] In particular, the audio stream may preferably include meannormalized, 23 dimensional, Mel frequency cepstral coefficient (MFCC)vectors (no C0) computed using 24 filters. Of course, other audiofeature vectors may be employed.

[0041] Further, visual features may preferably be extracted using anappearance-based technique. For example, the appearance-based techniquesof G. Potamianos et al., “A Cascade Visual Front End for SpeakerIndependent Automatic Speechreading,” International Journal of SpeechTechnology, vol. 4(3-4), pp. 193-208, 2001, the disclosure of which isincorporated by reference herein, may be employed. For each video frame,a statistical face tracking methodology is used to define a region ofinterest to which a two-dimensional, separable, discrete cosinetransform (DCT) is applied. The 24 highest energy (over all trainingdata) DCT coefficients are retained and mean normalization is applied tocompensate for lighting variations. No delta parameters are used. Ofcourse, other video feature vectors may be employed.

[0042] The audio and video vector coefficients may further be processedvia short-time Gaussianization which attempts to mitigate the effects onthe mean and variance parameters of linear channel and additive noisedistortions by locally mapping the features to the standard normaldistribution. For example, the Gaussianization techniques described inB. Xiang et al., “Short-Time Gaussinization for Robust SpeakerVerification,” Proceedings of ICASSP, Orlando, Fla., May 2002, thedisclosure of which is incorporated by reference herein, may beemployed.

[0043] A description of the speaker models that may be employed inaccordance with an illustrative embodiment of the invention will now beprovided.

[0044] Speaker modeling is preferably based on a Gaussian Mixture Model(GMM) framework, for example, as described in D. A. Reynolds et al.,“Robust Text-Independent Speaker Identification Using Gaussian MixtureSpeaker Models,” IEEE Transactions on Speech and Audio Processing, vol.3, no. 1, January 1995, the disclosure of which is incorporated byreference herein. The speaker modeling may also includetransformation-based enhancements described in U. V. Chaudhari et al.,“Transformation Enhanced Multi-grained Modeling for Text-IndependentSpeaker Recognition,” Proceedings of the International Conference onSpoken Language Processing (ICSLP), Beijing, October 2000, thedisclosure of which is incorporated by reference herein, which usefeature space optimizations on top of the initial feature sets.

[0045] These optimizations, via Maximum Likelihood Linear Transformation(MLLT) as described in R. A. Gopinath, “Maximum Likelihood Modeling withGaussian Distributions for Classification,” Proceedings of ICASSP,Seattle, May 1998, the disclosure of which is incorporated by referenceherein, are conditioned on the models, which must therefore be builtbefore the optimization. For each data stream s, and speaker j, theN_(s) ^(j)-component model, M_(s) ^(j,o), is parameterized, prior to thefeature space optimization, by$\left\{ {m_{s,i}^{j,o},{\overset{j,o}{\sum\limits_{s,i}}{,p_{s,i}^{j}}}} \right\}_{{i = 1},\ldots \quad,{N\quad \frac{j}{s}}},$

[0046] including the estimates of the mean, covariance, and mixtureweight parameters. Restriction to diagonal covariance models occurs in atransformed feature space where an MLLT transformation T_(s) ^(j) ischosen, via a gradient descent, to minimize the loss in likelihood thatresults from the restriction.

[0047] Consequently, the new model parameterization is:${M_{s}^{j} = {{T_{s}^{j}M_{s}^{j,o}} \equiv \left\{ {m_{s,i}^{j},{\overset{j}{\sum\limits_{s,i}}{,p_{s,i}^{j}}}} \right\}_{{i = 1},\ldots \quad,{N\quad}_{s}^{j,}}}},$

[0048] where$M_{s,i}^{j} = {{T_{s}^{j}M_{s,i}^{j,o}\quad {and}\quad \overset{j}{\sum\limits_{s,i}}} = {{{diag}\left( {T_{s}^{j}{\overset{j,o}{\sum\limits_{s,i}}T_{s}^{j,T}}} \right)}.}}$

[0049] Note that the feature space optimization is carried outindependently for each speaker model and each feature stream. As aresult, each speaker model has its own associated feature space.

[0050] A description of discriminants with time-varyingcontext-dependent parameters that may be employed in accordance with anillustrative embodiment of the invention will now be provided.

[0051] The invention preferably uses a modified likelihood baseddiscriminant function that takes into account the added transformationdescribed above. Given a set of vectors X^(s)={x_(t) ^(s)} in R^(n) fromsome stream s, the base discriminant function for any individualstream-dependent target model M_(s) ^(j) is: $\begin{matrix}{{{d_{s}\left( x_{t}^{s} \middle| M_{s}^{j} \right)} = {\,_{\quad i}^{\max}\left\lbrack {\log \quad {p\left( {\left. {T_{s}^{j}x_{t}^{s}} \middle| m_{s,i}^{j} \right.,{\overset{j}{\sum\limits_{s,i}}{,p_{s,i}^{j}}}} \right)}} \right\rbrack}},} & (1)\end{matrix}$

[0052] where the index i runs through the mixture components in themodel M_(s) ^(j) and p(·) is a multi-variate Gaussian density. Themulti-stream input to the identification system is X={X^(a),X^(v)}, aset of two streams with N vectors in each stream.

[0053] Thus, a general discriminant is defined with time-varyingparameters for an N frame input as (t∈{1, . . . , N} and s∈{a, v}):$\begin{matrix}{{{D\left( X \middle| j \right)} = {\sum\limits_{t}\quad {\sum\limits_{s}\quad {\left\lbrack {{\Phi_{t}^{s}(j)} + {\Psi_{t}^{s}(j)}} \right\rbrack \eta_{s}{d_{s}\left( x_{t}^{s} \middle| M_{s}^{j} \right)}}}}},{and}} & (2) \\{{D\left( X \middle| j \right)} = {{\sum\limits_{t}\quad {\sum\limits_{s}\quad {\Phi_{t}^{s}(j)}}} + {{\Psi_{t}^{s}(j)}\eta_{s}{{d_{s}\left( x_{t}^{s} \middle| M_{s}^{j} \right)}.}}}} & (3)\end{matrix}$

[0054] where Φ_(t) ^(s)(j) and Ψ_(t) ^(s)(j) are time, stream, and modeldependent parameters that respectively measure the local congruence (orcoverage or richness) of the test data with respect to the model and thestability (or consistency or predictability) of the score stream, andη_(s) normalizes the scale of the scores. Note that there is a productand sum form of the combination. It is to be understood that Φ measuresthe match of the test data to the models and Ψ measures thepredictability of the score stream. They are the normalized parameters,as will be defined below, derived from φ_(t) ^(s)(j) and ψ_(t) ^(s)(j).

[0055] A description of a coverage parameter that may be employed inaccordance with an illustrative embodiment of the invention will now beprovided.

[0056] To determine φ_(t) ^(s)(j), which is a measure of the coverage ofthe model by the test data, the roles of the test and training data areinverted and an “inverse” likelihood is computed. That is, for a time t,a neighborhood (in time) of the test vector is modeled by a GMM, M_(s,t)^(test), and the likelihood of the model parameters M_(s) ^(j), and/ortraining data (for M_(s) ^(j)), is measured with respect to the testdata model in computing the parameter. In its generalized form, thecoverage parameter equation is: $\begin{matrix}{{\varphi_{t}^{s}(j)} = {\sum\limits_{i}\quad {a_{s,i}^{i}{d_{s}\left( m_{s,i}^{j} \middle| {T_{s}^{j}M_{\quad s}^{test}} \right)}}}} & (4)\end{matrix}$

[0057] where T_(s) ^(j)M_(s,t) ^(test) denotes transformation of thetest model in to M_(s) ^(j)'s feature space and a_(s,i) ^(j) is aconstant proportional to p_(s,i) ^(j) and |Σ_(s,i) ^(j)|, thedeterminant of the diagonal matrix in the optimal model feature space.In the sum, i ranges over all the components of M_(s) ^(j).

[0058] Referring now to FIG. 2, a diagram illustrates the behavior of acoverage parameter computed by an authentication system, according to anembodiment of the present invention. More specifically, FIG. 2graphically illustrates normalized coverage for 25 models over twodifferent utterances (i.e., Utterance 1 and Utterance 2). Normalizationwill be described below.

[0059] The major trend over the models can be seen in FIG. 2 to berelatively consistent for the two utterances, which indicates therelative richness of the training data for the 25 models used. However,there are a fair number of models where the values diverge, indicatingthe variable relative richness of the test data (in the two utterances).The power of this measure lies in the fact that φ is not symmetric withrespect to the roles of the training and test data. A high concentrationof test data near one model component can yield a high likelihood, yetwhen the roles are reversed, and the model is built on the test data,the training data means and/or vectors could have a very low likelihood,indicating that the training data is much richer in comparison to thetest data. One can associate φ with the fraction of the model covered bythe test data.

[0060] A description of a stability parameter that may be employed inaccordance with an illustrative embodiment of the invention will now beprovided.

[0061] The parameter ψ_(t) ^(s)(j) is computed using the deviation ofthe score at time t from a point estimate of the score at time t, basedon a neighborhood (in time) of test vectors X_(nbhd) ^(s) (the size ofthis neighborhood is, in general, independent of that used fordetermining φ). The parameter ψ_(t) ^(s)(j) is a measure of the relativeinstability of the data stream at time t and is defined as:$\begin{matrix}{{\psi_{t}^{s}(j)} = {\beta_{s,t}^{j}\frac{{d_{s}\left( x_{t}^{s} \middle| M_{s}^{j} \right)} - {\mu \left\lbrack {{d_{s}\left( x^{s} \middle| M_{s}^{j} \right)};{x^{s} \in X_{nbhd}^{s}}} \right\rbrack}}{\sigma \left\lbrack {{d_{s}\left( x^{s} \middle| M_{s}^{j} \right)};{x^{s} \in X_{nbhd}^{s}}} \right\rbrack}}} & (5)\end{matrix}$

[0062] Notice that value β_(s,t) ^(j) should ensure that ψ_(t) ^(s)(j)is positive.

[0063] Referring now to FIG. 3, a diagram illustrates the behavior of astability parameter computed by an authentication system, according toan embodiment of the present invention. More specifically, FIG. 3graphically illustrates a video stream deviation for a 1000 framesection. While the deviation shown typically hovers closely to aconstant value, there are a number of sections where the deviationbecomes quite large. The ψ parameter is related to the φ factor in thatan unstable score stream can be the result of the differing richness ofthe training and test data. However, as one does not necessarily implythe other, it is advantageous to use both parameters.

[0064] A description of a normalization operation that may be employedin accordance with an illustrative embodiment of the invention will nowbe provided.

[0065] It is to be understood that Φ and Ψ are normalized parametersbased on φ and ψ and are defined as:${{\Phi_{t}^{s}(j)} = {{\varphi_{t}^{s}(j)}/{\sum\limits_{q \in {\{{a,v}\}}}\quad {\varphi_{t}^{q}(j)}}}},{{\Psi_{t}^{s}(j)} = {\left( {1/{\psi_{t}^{s}(j)}} \right)/{\sum\limits_{q \in {\{{a,v}\}}}{\left( {1/{\psi_{t}^{q}(j)}} \right).}}}}$

[0066] Such normalization induces a context dependence since the weightson one stream depend on the other. The reciprocal is used in computing Ψbecause it is desired that the factor be inversely proportional to thedeviation. The η_(s) parameter, incorporating the normalization for thescale differences in the score streams, is set (based on empiricalperformance) to 1/μ_(s,global), which is the reciprocal of the meanvalue of the stream elements to be combined, taken over a large enoughsample of data so that the mean value is practically constant.

[0067] Note that while the explanation above used two streams, thenumber of streams is not so limited. In the case of more than twostreams, the coverage and deviation parameters may be computed for eachstream, and the normalization step is implemented such that the sums areover all of the streams. Also, the number of scores added together toform the discriminants is increased by the number of additional datastreams.

[0068] Accordingly, in accordance with the above-described parametersand discriminants of the present invention, speaker authentication iscarried out by computing equation (2) or (3) for each speaker j, andletting the decision be given by: id = arg_(  j)^(  max )  D(X|j).

[0069] Referring now to FIG. 4, a block/flow diagram illustrates anauthentication methodology for implementation in an authenticationsystem, according to an embodiment of the present invention. It is to beappreciated that the diagram in FIG. 4 depicts the functional processingblocks and data flow involved in the time-varying stream reliabilityprediction-based authentication methodology described above. Note thatFIG. 4 shows that additional data (e.g., GPS data) may be employed inthe methodology. The addition of this other data type is straightforwardgiven the above detailed explanation of how audio and video data streamsare processed to make an authentication decision. That is, the use ofthis additional data type is the same as the use of the audio and videodata, i.e., statistical models are built and frame scores are obtained,the frame scores are then combined according the weightings such as aredefined above in the explanation of normalization.

[0070] As shown, FIG. 4 depicts details of authentication system 118(FIG. 1). Authentication system 118 receives multiple data streams inthe form of video data 402, audio data 404 and other data 406 (e.g., GPSdata). Recall that these streams are processed to form the featurestreams described above. The authentication system 118 comprises thefollowing blocks: score computation module 408, coverage parameter φ_(t)^(s)(j) and stability parameter ψ_(t) ^(s)(j) computation module 410,normalization module 412, statistical model store 414, and contextdependent fusion module 416.

[0071] In accordance with these modules, authentication of a speaker maybe performed as follows.

[0072] As shown, multiple streams of data including video data 402,audio data 404 and other data (e.g., GPS data) 406 are simultaneouslyobtained in association with the individual, i.e., speaker j. This stepis illustratively described above. Each stream is provided to both thescore computation module 408 and the coverage and stability parameterscomputation module 410.

[0073] In the score computation module 408, scores are computed for eachtime interval t (e.g., frame) by comparing the input video data 402, theaudio data 404 and the GPS data 406 with the corresponding statisticalmodels previously trained and stored in model store 414 (e.g., audiodata to audio models, video data to video models, GPS data to GPSmodels). While scores associated with audio data and video data areunderstood to represent likelihoods or probabilities as are well known,GPS models and scores may involve the following. GPS models may be builtfrom an output of a GPS tracking system and represent specific GPStracking coordinates. With regard to GPS data scoring, when the speakeris at a location at which he normally resides, a higher score will begenerated, as compared to a score generated when the speaker is at alocation other than a normal or expected location.

[0074] In computation module 410, the coverage parameter φ is computed,as explained above, for each stream s (audio, video, GPS) at timeinterval t for speaker j, in accordance with the training models storedin block 414. Recall that the coverage parameters represent the relativerichness of the training models, i.e., they represent a measure of thecoverage of the models by the input test data.

[0075] Further, in computation module 410, the stability parameter φ isalso computed, as explained above, for each stream s (audio, video, GPS)at time intervals t for speaker j. Recall that the stability parameterrepresents a deviation of a score at time t from a point estimate of thescore at time t.

[0076] In normalization module 412, the coverage and stabilityparameters are normalized, as explained above, to induce a contextdependence. Then, in context dependent fusion module 416, the scoresgenerated in block 408 are summed based on the weights produced inaccordance with the normalization operation. Lastly, the scores over allframes are summed based on the weights. The result is considered theresult of the computation of the discriminant described above inaccordance with equations (2) or (3). Advantageously, using theparameters described herein, the authentication system is able topredictively take into account the time-varying nature of the datastreams obtained from the individual.

[0077] It is to be understood that the output of the fusion module 416may take several forms. However, in one preferred form, the outputcomprises a composite score (summed in accordance with the weights) foreach training model stored in store 414. Based on the score, anidentification or verification decision is computed. For example,identification is the process whereby the model with the highest scoreor result is chosen. Verification compares the score of the modelassociated with a claimed identity to that of the background model,accepting or rejecting the claim based on the difference. Thus, inaddition to the composite score, or as an alternative to the compositescore, module 416 may output an authentication decision message (e.g.,message indicating that the subject individual or entity is“authenticated” or “not authenticated”).

[0078] Referring now to FIG. 5, a block diagram illustrates an exemplarycomputing system environment for implementing an authentication system,according to one embodiment of the invention. It is to be understoodthat an individual or user may interact with the system locally orremotely. In the local case, the individual/user interacts directly withthe computing system embodying the system. In the remote case, theindividual/user interacts with the computing system (e.g., server)embodying the system via another computing system (e.g., a client orclient device), wherein the client and server communicate over adistributed network. The network may be any suitable network acrosswhich the computer systems can communicate, e.g., the Internet or WorldWide Web, local area network, etc. However, the invention is not limitedto any particular type of network. In fact, it is to be understood thatthe computer systems may be directly linked without a network.

[0079] As shown, the computing system 500 comprises a processor 502,memory 504 and I/O devices 506, all coupled via a computer bus 508. Itshould be understood that the term “processor” as used herein isintended to include one or more processing devices, including a centralprocessing unit (CPU) or other processing circuitry, e.g., digitalsignal processor, application-specific integrated circuit, etc. Also,the term “memory” as used herein is intended to include memoryassociated with a processor or CPU, such as RAM, ROM, a fixed,persistent memory device (e.g., hard drive), or a removable, persistentmemory device (e.g., diskette or CDROM). In addition, the term “I/Odevices” as used herein is intended to include one or more input devices(e.g., keyboard, mouse) for inputting data to the processing unit, aswell as one or more output devices (e.g., CRT display) for providingresults associated with the processing unit. Further, the I/O devicesassociated with the computing system 800 are understood to include thosedevices and processing equipment necessary to capture and process theparticular data associated with an individual/user, as mentioned indetail above with respect to the capturing of the multiple data streams.

[0080] It is also to be understood that the computing system illustratedin FIG. 5 may be implemented in the form of a variety of computerarchitectures, e.g., a personal computer, a personal digital assistant,a cellular phone, a microcomputer, a minicomputer, etc. However, theinvention is not limited to any particular computer architecture.

[0081] Accordingly, software instructions or code for performing themethodologies of the invention, as described herein, may be stored inone or more of the associated memory devices, e.g., ROM, fixed orremovable memory, and, when ready to be utilized, loaded into RAM andexecuted by the CPU.

[0082] A description of some experimental results obtained in accordancewith an illustrative embodiment of the invention will now be provided.

[0083] The experiments described below are based on an audio-visualdatabase including 304 speakers. The speech and audio were captured asthe users read prompted text while in front of a computer that wasequipped with a microphone and camera. For each speaker, approximately120 seconds of speech was used for training and on average the testutterances were 6.7 seconds long with a standard deviation of 2.7.Experiments were conducted at both the utterance level and frame level.For the frame level experiments, a 100 speaker subset of the data waschosen to reduce computation and storage costs. The total number oftests for the full (All) and reduced (100 spkr) sets are 19714 and 7307respectively.

[0084] Results (authentication rates) are given in FIG. 6 for the caseswhere the three streams X^(a), X^(v), and X^(av) are used in isolation(there is no score combination or weighting). Recall that X^(av) is thevector-wise concatenation of X^(a) and X^(v). The discriminant in thesecases is D(X|j)=Σ_(t)d_(s)(x_(t) ^(s)|M_(s) ^(j)), where s is either a,v, or av. As can be seen from the results in FIG. 6, vector-wiseconcatenation can be detrimental. It is evident that the speakers forwhom good video data existed and still preserved the base audio errorrate were chosen for the reduced 100 spkr set. Also, for the sake ofcomparison, results are given for the case where the streams areweighted with a constant factor for all time, i.e.:${D\left( X \middle| j \right)} = {\sum\limits_{t}{\left\lbrack {{w_{a}{d_{a}\left( x_{t}^{a} \middle| M_{a}^{j} \right)}} + {\left( {1 - w_{a}} \right){d_{v}\left( x_{t}^{v} \middle| M_{v}^{j} \right)}}} \right\rbrack.}}$

[0085] The authentication performance on the 100 spkr set is computed ona grid of weights ranging from 0.0 to 0.1. The boundary error rates arethe same as in FIG. 6. It may be observed that there is a monotonicincrease in accuracy until the fraction of audio goes beyond 0.9, wherethe performance peaks at 98.9 percent, showing some benefit of addingvideo to the audio system.

[0086]FIG. 7 focuses on the reduced 100 spkr population and theframe-level combination experiments. In FIG. 7, the effects of using thetime and context dependent weights (Φ and Ψ) in isolation and together,using the sum form (equation (2) above), are shown. Using eitherparameter in isolation is beneficial, but using both together clearlyoutperforms all cases. If the speakers for whom at least one decision(for one test utterance) changed are considered, 27 speakers whose testsaccount for 3131 trials are obtained. The improvement for these speakers(over the audio only case) is given in FIG. 7.

[0087] Accordingly, as explained in detail above, the present inventionprovides a methodology for combining information present in two or morestreams of data. The invention realizes that the quality of data and therichness of the testing data relative to the training data vary overtime and in fact within the boundaries of an utterance. A notion of datareliability is provided incorporating the stability, or pointconsistency, of a score stream and the coverage of the model by the testdata. Experiments showed that this decision method outperformed the useof audio alone, video alone, or a concatenation of the streams. Theresults are significant because they are obtained for the clean speechcase, for which it was previously questioned whether adding video datacould improve performance.

[0088] Furthermore, as explained above, the techniques of the inventioncan, more generally, be used for a variety of joint processing/decisionmaking operations when two or more data streams are involved. By way ofexample only, based on a determination of what words are being spoken,an emotional state, etc., the invention can use this information to makebuy/sell decisions when one stream is the stock price and the otherstream is volume. Also, the data streams may include data other thanaudio and video, e.g., GPS readings, keyboard strokes, gait,electrocardiogram readings, etc. Still further, the data streams may be“derived data” streams, e.g., a word sequence derived from speech or lipreading, etc. In addition to making decisions, the coverage measureand/or stability measure can be used as confidence measure to either notmake a decision or delay the decision until coverage and/or stabilityexceeds a threshold value.

[0089] Although illustrative embodiments of the present invention havebeen described herein with reference to the accompanying drawings, it isto be understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may bemade by one skilled in the art without departing from the scope orspirit of the invention.

What is claimed is:
 1. A method of generating a decision associated withone of an individual and an entity, the method comprising the steps of:capturing two or more data streams associated with one of the individualand the entity; computing at least one time-varying measure inaccordance with the two or more data streams; and computing a decisionbased on the at least one time-varying measure.
 2. The method of claim1, wherein the at least one time-varying measure comprises a measure ofthe coverage of a model associated with previously-obtained trainingdata by at least a portion of the captured data.
 3. The method of claim2, wherein the coverage measure is determined in accordance with aninverse likelihood computation.
 4. The method of claim 3, wherein theinverse likelihood computation comprises modeling, for a time t, aneighborhood of a test vector associated with the captured data togenerate a test data model, and measuring the likelihood of one or moreparameters of the training data model with respect to the test datamodel.
 5. The method of claim 4, wherein the feature space associatedwith the test data model is transformed into the feature spaceassociated with the training data model.
 6. The method of claim 1,wherein the at least one time-varying measure comprises a measure of thestability of at least a portion of the captured data.
 7. The method ofclaim 6, wherein the stability measure is determined in accordance witha deviation computation.
 8. The method of claim 7, wherein the deviationcomputation comprises computing a deviation of a score at time t from apoint estimate of the score at time t based on a neighborhood of testvectors associated with the captured data.
 9. The method of claim 1,wherein the at least one time-varying measure is normalized.
 10. Themethod of claim 9, wherein the normalization induces a contextdependence.
 11. The method of claim 1, wherein the measure computationstep further comprises computing a second time-varying measureassociated with the two or more data streams.
 12. The method of claim11, wherein the at least one time-varying measure and the secondtime-varying measure comprise a coverage measure and a stabilitymeasure.
 13. The method of claim 12, wherein the measure computationstep further comprises the steps of: computing the coverage measure andthe stability measure for each data stream; normalizing the coveragemeasure and the stability measure over the two or more data streams; andforming discriminants by combining scores obtained from the captureddata based on the normalized measures, the discriminants being used tocompute the decision.
 14. The method of claim 1, wherein the decision isnot computed until the at least one time-varying measure is one ofgreater than and equal to a given threshold value.
 15. The method ofclaim 1, wherein the decision is a speaker authentication decision. 16.The method of claim 1, wherein the two or more data streams comprise twoor more of audio data, video data, and other data relating to one of theindividual and the entity.
 17. The method of claim 1, wherein the two ormore data streams comprise data derived from the data associated withone of the individual and the entity.
 18. Apparatus for generating adecision associated with one of an individual and an entity, theapparatus comprising: a memory; and at least one processor coupled tothe memory and operative to: (i) capture two or more data streamsassociated with one of the individual and the entity; (ii) compute atleast one time-varying measure in accordance with the two or more datastreams; and (iii) compute a decision based on the at least onetime-varying measure.
 19. The apparatus of claim 18, wherein the atleast one time-varying measure comprises a measure of the coverage of amodel associated with previously-obtained training data by at least aportion of the captured data.
 20. The apparatus of claim 18, wherein theat least one time-varying measure comprises a measure of the stabilityof at least a portion of the captured data.
 21. The apparatus of claim18, wherein the at least one time-varying measure is normalized.
 22. Theapparatus of claim 18, wherein the measure computation operation furthercomprises computing a second time-varying measure associated with thetwo or more data streams.
 23. The apparatus of claim 18, wherein thedecision is not computed until the at least one time-varying measure isone of greater than and equal to a given threshold value.
 24. Theapparatus of claim 18, wherein the decision is a speaker authenticationdecision.
 25. The apparatus of claim 18, wherein the two or more datastreams comprise two or more of audio data, video data, and other datarelating to one of the individual and the entity.
 26. The apparatus ofclaim 18, wherein the two or more data streams comprise data derivedfrom the data associated with one of the individual and the entity. 27.An article of manufacture for generating a decision associated with oneof an individual and an entity, comprising a machine readable mediumcontaining one or more programs which when executed implement the stepsof: capturing two or more data streams associated with one of theindividual and the entity; computing at least one time-varying measurein accordance with the two or more data streams; and computing adecision based on the at least one time-varying measure.
 28. The articleof claim 27, wherein the at least one time-varying measure comprises ameasure of the coverage of a model associated with previously-obtainedtraining data by at least a portion of the captured data.
 29. Thearticle of claim 27, wherein the at least one time-varying measurecomprises a measure of the stability of at least a portion of thecaptured data.
 30. The article of claim 27, wherein the at least onetime-varying measure is normalized.